Jonathan Christyadi (502705) - AI Core 02
This notebook aims at predicting the likelihood of a link being a phishing link or a legitimate link with a focus on exploring and testing hypotheses that necessitate further research.
import sklearn
import pandas as pd
import seaborn
import numpy as np
print("scikit-learn version:", sklearn.__version__) # 1.1.3
print("pandas version:", pd.__version__) # 1.5.1
print("seaborn version:", seaborn.__version__) # 0.12.1
scikit-learn version: 1.3.0 pandas version: 2.0.3 seaborn version: 0.12.2
After loading the dataset, I found out some inconsistencies among the data. First the label of the link (phishing or legitimate) can be changed into binary format. Also, for domain_with_copyright column, some are in binary and some are written in alphabets, for example: zero, One, etc.
df = pd.read_csv("Data\dataset_link_phishing.csv", sep=',', index_col=False, dtype='unicode')
df.head()
| id | url | url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | http://www.progarchives.com/album.asp?id=61737 | 46 | 20 | 0 | 3 | 0 | 0 | 1 | 0 | ... | 1 | one | 0 | 627 | 6678 | 78526 | 0 | 0 | 5 | phishing |
| 1 | 1 | http://signin.eday.co.uk.ws.edayisapi.dllsign.... | 128 | 120 | 0 | 10 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 300 | 65 | 0 | 0 | 1 | 0 | phishing |
| 2 | 2 | http://www.avevaconstruction.com/blesstool/ima... | 52 | 25 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 119 | 1707 | 0 | 0 | 1 | 0 | phishing |
| 3 | 3 | http://www.jp519.com/ | 21 | 13 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 1 | one | 0 | 130 | 1331 | 0 | 0 | 0 | 0 | legitimate |
| 4 | 4 | https://www.velocidrone.com/ | 28 | 19 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 0 | zero | 0 | 164 | 1662 | 312044 | 0 | 0 | 4 | legitimate |
5 rows × 87 columns
# Taking a look at the data types of the columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 19431 entries, 0 to 19430 Data columns (total 87 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 19431 non-null object 1 url 19431 non-null object 2 url_length 19431 non-null object 3 hostname_length 19431 non-null object 4 ip 19431 non-null object 5 total_of. 19431 non-null object 6 total_of- 19431 non-null object 7 total_of@ 19431 non-null object 8 total_of? 19431 non-null object 9 total_of& 19431 non-null object 10 total_of= 19431 non-null object 11 total_of_ 19431 non-null object 12 total_of~ 19431 non-null object 13 total_of% 19431 non-null object 14 total_of/ 19431 non-null object 15 total_of* 19431 non-null object 16 total_of: 19431 non-null object 17 total_of, 19431 non-null object 18 total_of; 19431 non-null object 19 total_of$ 19431 non-null object 20 total_of_www 19431 non-null object 21 total_of_com 19431 non-null object 22 total_of_http_in_path 19431 non-null object 23 https_token 19431 non-null object 24 ratio_digits_url 19431 non-null object 25 ratio_digits_host 19431 non-null object 26 punycode 19431 non-null object 27 port 19431 non-null object 28 tld_in_path 19431 non-null object 29 tld_in_subdomain 19431 non-null object 30 abnormal_subdomain 19431 non-null object 31 nb_subdomains 19431 non-null object 32 prefix_suffix 19431 non-null object 33 random_domain 19431 non-null object 34 shortening_service 19431 non-null object 35 path_extension 19431 non-null object 36 nb_redirection 19431 non-null object 37 nb_external_redirection 19431 non-null object 38 length_words_raw 19431 non-null object 39 char_repeat 19431 non-null object 40 shortest_words_raw 19431 non-null object 41 shortest_word_host 19431 non-null object 42 shortest_word_path 19431 non-null object 43 longest_words_raw 19431 non-null object 44 longest_word_host 19431 non-null object 45 longest_word_path 19431 non-null object 46 avg_words_raw 19431 non-null object 47 avg_word_host 19431 non-null object 48 avg_word_path 19431 non-null object 49 phish_hints 19431 non-null object 50 domain_in_brand 19431 non-null object 51 brand_in_subdomain 19431 non-null object 52 brand_in_path 19431 non-null object 53 suspecious_tld 19431 non-null object 54 statistical_report 19431 non-null object 55 nb_hyperlinks 19431 non-null object 56 ratio_intHyperlinks 19431 non-null object 57 ratio_extHyperlinks 19431 non-null object 58 ratio_nullHyperlinks 19431 non-null object 59 nb_extCSS 19431 non-null object 60 ratio_intRedirection 19431 non-null object 61 ratio_extRedirection 19431 non-null object 62 ratio_intErrors 19431 non-null object 63 ratio_extErrors 19431 non-null object 64 login_form 19431 non-null object 65 external_favicon 19431 non-null object 66 links_in_tags 19431 non-null object 67 submit_email 19431 non-null object 68 ratio_intMedia 19431 non-null object 69 ratio_extMedia 19431 non-null object 70 sfh 19431 non-null object 71 iframe 19431 non-null object 72 popup_window 19431 non-null object 73 safe_anchor 19431 non-null object 74 onmouseover 19431 non-null object 75 right_clic 19431 non-null object 76 empty_title 19431 non-null object 77 domain_in_title 19431 non-null object 78 domain_with_copyright 19431 non-null object 79 whois_registered_domain 19431 non-null object 80 domain_registration_length 19431 non-null object 81 domain_age 19431 non-null object 82 web_traffic 19431 non-null object 83 dns_record 19431 non-null object 84 google_index 19431 non-null object 85 page_rank 19431 non-null object 86 status 19431 non-null object dtypes: object(87) memory usage: 12.9+ MB
# Sampling the dataset
df.sample(10)
| id | url | url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4525 | 4525 | https://www.documentcloud.org/documents/246219... | 96 | 21 | 1 | 3 | 6 | 0 | 0 | 0 | ... | 1 | zero | 0 | 379 | 4368 | 31458 | 0 | 0 | 6 | legitimate |
| 1938 | 1938 | https://www.simplypsychology.org/psychosexual.... | 50 | 24 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 1964 | 3515 | 10338 | 0 | 0 | 5 | legitimate |
| 13915 | 5914 | https://mail.parkhill.k12.mo.us/owa/auth/logon... | 123 | 23 | 1 | 9 | 0 | 0 | 1 | 1 | ... | 1 | 0 | 1 | 0 | -1 | 105946 | 0 | 1 | 4 | phishing |
| 7115 | 7115 | http://www.online-tech-tips.com/free-software-... | 75 | 24 | 0 | 2 | 6 | 0 | 0 | 0 | ... | 1 | one | 0 | 636 | 4843 | 5881 | 0 | 0 | 5 | legitimate |
| 7892 | 7892 | http://www.true-piano-lessons.com/free-piano-l... | 57 | 26 | 0 | 3 | 4 | 0 | 0 | 0 | ... | 1 | zero | 0 | 169 | -1 | 1003956 | 0 | 0 | 3 | legitimate |
| 8664 | 663 | https://www.ulrc.go.ug/scripts/?cliente=3D6624... | 51 | 14 | 1 | 3 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 421 | 10587379 | 0 | 0 | 3 | phishing |
| 9152 | 1151 | http://www.straighttalkforthesoul.com/ | 38 | 30 | 1 | 2 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 359 | 1466 | 0 | 0 | 0 | 2 | legitimate |
| 6346 | 6346 | http://safelinknojutsu.blogspot.com/ | 36 | 28 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 1 | one | 0 | 373 | 7296 | 1488112 | 0 | 0 | 5 | legitimate |
| 7989 | 7989 | http://www.softpedia.com/get/System/File-Manag... | 76 | 17 | 0 | 3 | 3 | 0 | 0 | 0 | ... | 1 | zero | 0 | 698 | 6242 | 2409 | 0 | 0 | 6 | legitimate |
| 5394 | 5394 | https://20200724065829-dot-s2pe7ed9y.rj.r.apps... | 70 | 45 | 1 | 5 | 2 | 0 | 0 | 0 | ... | 1 | zero | 0 | 228 | 5616 | 0 | 0 | 1 | 5 | phishing |
10 rows × 87 columns
After understanding the data on the sample, I found that some data are not in a good form and there is a room for improvement, such as the domain_with_copyright and status columns.
df['status'].unique()
array(['phishing', 'legitimate'], dtype=object)
As you can see on the status column, there are only 2 values, phishing and legitimate. Which mean I can transform it into binary values (0 and 1).
df['status'] = df['status'].map({'phishing': 1, 'legitimate': 0})
df.head()
| id | url | url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | http://www.progarchives.com/album.asp?id=61737 | 46 | 20 | 0 | 3 | 0 | 0 | 1 | 0 | ... | 1 | one | 0 | 627 | 6678 | 78526 | 0 | 0 | 5 | 1 |
| 1 | 1 | http://signin.eday.co.uk.ws.edayisapi.dllsign.... | 128 | 120 | 0 | 10 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 300 | 65 | 0 | 0 | 1 | 0 | 1 |
| 2 | 2 | http://www.avevaconstruction.com/blesstool/ima... | 52 | 25 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 119 | 1707 | 0 | 0 | 1 | 0 | 1 |
| 3 | 3 | http://www.jp519.com/ | 21 | 13 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 1 | one | 0 | 130 | 1331 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | https://www.velocidrone.com/ | 28 | 19 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 0 | zero | 0 | 164 | 1662 | 312044 | 0 | 0 | 4 | 0 |
5 rows × 87 columns
After a closer look, I spotted that there are some inconsistencies with the value on domain_with_copyright column, for example One and one. Similarly, I want to transform it into binary value 0 and 1, instead of the string
df['domain_with_copyright'].unique()
array(['one', 'zero', 'One', 'Zero', '1', '0'], dtype=object)
df['domain_with_copyright'] = df['domain_with_copyright'].map({'one': 1, 'zero': 0, 'Zero': 0, 'One': 1,'1': 1, '0': 0}).astype(int)
df['domain_with_copyright'].unique()
array([1, 0])
# Calculate the total number of missing values in the DataFrame
total_na = df.isna().sum()
# Calculate the total number of missing values in the DataFrame
total_null = df.isnull().sum()
total_null.sum()
0
Making a function to check which feature contain binary values.
# Finding columns with binary values
def count_binary_columns(df):
results = []
counter = 0
for col in df.columns:
counter += 1
if df[col].isin([0, 1]).all():
results.append(col)
return results, counter
count_binary_columns(df)
(['domain_with_copyright', 'status'], 87)
df = df.drop(columns=['id', 'url'])
df.head()
| url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | total_of= | total_of_ | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 46 | 20 | 0 | 3 | 0 | 0 | 1 | 0 | 1 | 0 | ... | 1 | 1 | 0 | 627 | 6678 | 78526 | 0 | 0 | 5 | 1 |
| 1 | 128 | 120 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 300 | 65 | 0 | 0 | 1 | 0 | 1 |
| 2 | 52 | 25 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 119 | 1707 | 0 | 0 | 1 | 0 | 1 |
| 3 | 21 | 13 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 130 | 1331 | 0 | 0 | 0 | 0 | 0 |
| 4 | 28 | 19 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 164 | 1662 | 312044 | 0 | 0 | 4 | 0 |
5 rows × 85 columns
df['whois_registered_domain'].unique()
array(['0', '1'], dtype=object)
print(df['status'].value_counts())
df['status'].value_counts().plot(kind='bar', title='Count the target variable')
status 0 9716 1 9715 Name: count, dtype: int64
<Axes: title={'center': 'Count the target variable'}, xlabel='status'>
A heatmap will be used to select a suitable set of features to predict the status target upon. At this stage, I have zero idea which feature to use and I utilized heatmap to find features with the most corellation with the target feature.
First, to determine which feature to be used on the model, I want to visualize the correlation of the features.
import seaborn as sns
import matplotlib.pyplot as plt
corr = df.corr()
plt.figure(figsize=(100, 100))
plot = sns.heatmap(corr, annot=True, fmt='.2f', linewidths=2)
To select the most suitable features for predicting the target variable (status), a heatmap was created to visualize the correlation between the features. By analyzing the heatmap, we can identify the features that have the highest positive or negative correlation with the target variable.
Now I want to make a bar plot of the correlation with the target variable, which helps me to identify the important featueres, understand the relationship and simplify it.
# Sorting the correlation values with the target variable in descending order
corr.drop('status').sort_values(by='status', ascending=False).plot.bar(y='status', title='Correlation with the target variable', figsize=(20, 10))
<Axes: title={'center': 'Correlation with the target variable'}>
It can be seen on the plot bar that there are alot of features, I want to narrow it down by finding features with the most correlation in terms of numerical value.
# Finding the most correlated features with the target variable based on numerical featrures excluding NaN values
correlation_matrix = df.corr(numeric_only=True)
sorted_corr = correlation_matrix.sort_values(by='status',ascending=False)
sorted_corr
| url_length | hostname_length | total_of. | total_of- | total_of? | total_of/ | total_of_www | ratio_digits_url | phish_hints | nb_hyperlinks | domain_in_title | domain_with_copyright | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| status | 0.244348 | 0.240681 | 0.205302 | -0.102849 | 0.293920 | 0.240892 | -0.444561 | 0.356587 | 0.337287 | -0.341295 | 0.339519 | -0.175469 | 0.730684 | -0.509761 | 1.000000 |
| google_index | 0.233061 | 0.216919 | 0.208764 | -0.018285 | 0.202097 | 0.289212 | -0.357215 | 0.323157 | 0.279906 | -0.269482 | 0.265933 | -0.144499 | 1.000000 | -0.386721 | 0.730684 |
| ratio_digits_url | 0.434626 | 0.171761 | 0.224194 | 0.110341 | 0.325739 | 0.206925 | -0.211165 | 1.000000 | 0.096967 | -0.128915 | 0.152393 | -0.027357 | 0.323157 | -0.181489 | 0.356587 |
| domain_in_title | 0.124224 | 0.218850 | 0.108442 | 0.009843 | 0.092191 | 0.088462 | -0.178402 | 0.152393 | 0.125857 | -0.217548 | 1.000000 | 0.076105 | 0.265933 | -0.332742 | 0.339519 |
| phish_hints | 0.332000 | -0.019901 | 0.168765 | 0.065562 | 0.208052 | 0.501321 | -0.090812 | 0.096967 | 1.000000 | -0.112423 | 0.125857 | -0.066130 | 0.279906 | -0.203464 | 0.337287 |
| total_of? | 0.523172 | 0.164129 | 0.353133 | 0.035958 | 1.000000 | 0.243749 | -0.115337 | 0.325739 | 0.208052 | -0.112604 | 0.092191 | -0.046123 | 0.202097 | -0.123151 | 0.293920 |
| url_length | 1.000000 | 0.217586 | 0.447198 | 0.406951 | 0.523172 | 0.486490 | -0.067973 | 0.434626 | 0.332000 | -0.098101 | 0.124224 | -0.004281 | 0.233061 | -0.099900 | 0.244348 |
| total_of/ | 0.486490 | -0.061203 | 0.242216 | 0.204793 | 0.243749 | 1.000000 | -0.005628 | 0.206925 | 0.501321 | -0.073183 | 0.088462 | -0.023213 | 0.289212 | -0.113861 | 0.240892 |
| hostname_length | 0.217586 | 1.000000 | 0.406834 | 0.059480 | 0.164129 | -0.061203 | -0.130991 | 0.171761 | -0.019901 | -0.104614 | 0.218850 | 0.073107 | 0.216919 | -0.160621 | 0.240681 |
| total_of. | 0.447198 | 0.406834 | 1.000000 | 0.049303 | 0.353133 | 0.242216 | 0.068290 | 0.224194 | 0.168765 | -0.093994 | 0.108442 | 0.057320 | 0.208764 | -0.098752 | 0.205302 |
| total_of- | 0.406951 | 0.059480 | 0.049303 | 1.000000 | 0.035958 | 0.204793 | 0.045756 | 0.110341 | 0.065562 | -0.004513 | 0.009843 | 0.020914 | -0.018285 | 0.104676 | -0.102849 |
| domain_with_copyright | -0.004281 | 0.073107 | 0.057320 | 0.020914 | -0.046123 | -0.023213 | 0.087826 | -0.027357 | -0.066130 | 0.192159 | 0.076105 | 1.000000 | -0.144499 | 0.057127 | -0.175469 |
| nb_hyperlinks | -0.098101 | -0.104614 | -0.093994 | -0.004513 | -0.112604 | -0.073183 | 0.114259 | -0.128915 | -0.112423 | 1.000000 | -0.217548 | 0.192159 | -0.269482 | 0.221066 | -0.341295 |
| total_of_www | -0.067973 | -0.130991 | 0.068290 | 0.045756 | -0.115337 | -0.005628 | 1.000000 | -0.211165 | -0.090812 | 0.114259 | -0.178402 | 0.087826 | -0.357215 | 0.110745 | -0.444561 |
| page_rank | -0.099900 | -0.160621 | -0.098752 | 0.104676 | -0.123151 | -0.113861 | 0.110745 | -0.181489 | -0.203464 | 0.221066 | -0.332742 | 0.057127 | -0.386721 | 1.000000 | -0.509761 |
On the left side is the feature name and on the right side is the correlation values which indicates the strength and direction of the correlation between each features and target variable
# Get all the correlated features with the target variable
num_features = len(sorted_corr['status']) # 15 features
sorted_corr['status'].head(num_features)
status 1.000000 google_index 0.730684 ratio_digits_url 0.356587 domain_in_title 0.339519 phish_hints 0.337287 total_of? 0.293920 url_length 0.244348 total_of/ 0.240892 hostname_length 0.240681 total_of. 0.205302 total_of- -0.102849 domain_with_copyright -0.175469 nb_hyperlinks -0.341295 total_of_www -0.444561 page_rank -0.509761 Name: status, dtype: float64
Now I can utilize the features (except the target variable) with the most correlation into the model.
# List the features from the previous step into a list
selected_features = ['google_index', 'ratio_digits_url', 'domain_in_title', 'phish_hints', 'total_of?', 'url_length', 'total_of/','hostname_length','total_of.', 'total_of-','domain_with_copyright','nb_hyperlinks','total_of_www','page_rank']
df[selected_features] = df[selected_features].apply(pd.to_numeric, errors='coerce')
# Check the data types of the selected columns after conversion
print(df[selected_features].dtypes)
# Check if 'status' column exists and has categorical or numerical data
print(df['status'].dtype)
# Create a DataFrame with the selected columns
selected_df = df[selected_features + ['status']]
selected_df.head()
google_index int64 ratio_digits_url float64 domain_in_title int64 phish_hints int64 total_of? int64 url_length int64 total_of/ int64 hostname_length int64 total_of. int64 total_of- int64 domain_with_copyright int32 nb_hyperlinks int64 total_of_www int64 page_rank int64 dtype: object int64
| google_index | ratio_digits_url | domain_in_title | phish_hints | total_of? | url_length | total_of/ | hostname_length | total_of. | total_of- | domain_with_copyright | nb_hyperlinks | total_of_www | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.108696 | 1 | 0 | 1 | 46 | 3 | 20 | 3 | 0 | 1 | 143 | 1 | 5 | 1 |
| 1 | 1 | 0.054688 | 1 | 2 | 0 | 128 | 3 | 120 | 10 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 1 | 0.000000 | 1 | 0 | 0 | 52 | 4 | 25 | 3 | 0 | 0 | 3 | 1 | 0 | 1 |
| 3 | 0 | 0.142857 | 1 | 0 | 0 | 21 | 3 | 13 | 2 | 0 | 1 | 404 | 1 | 0 | 0 |
| 4 | 0 | 0.000000 | 0 | 0 | 0 | 28 | 3 | 19 | 2 | 0 | 0 | 57 | 1 | 4 | 0 |
# Count the number of binary columns in the selected features
features_binary = count_binary_columns(df[selected_features])
features_binary
(['google_index', 'domain_in_title', 'domain_with_copyright'], 14)
Now I scale the data appropriately.
# from sklearn.preprocessing import StandardScaler
# # Scale the data
# selected_df = selected_df.dropna()
# scaler = StandardScaler()
# selected_df[selected_features] = scaler.fit_transform(selected_df[selected_features])
Visualize the correlations, distributions, and patterns between multiple variables in the dataset.
# Create pairplot
sns.pairplot(selected_df, hue='status', palette='Set1')
# Add legends
plt.legend(title='Status', labels=['Phishing', 'Legitimate'])
# Show the plot
plt.show()
c:\Users\jochr\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
In this section I want to split the target and feature variables into X and y.
target = 'status'
X = df[selected_features]
y = df[target]
Splitting the train and test set 80% and 20% respectively. So around 15.5k are in train set and 4k in test set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 19431 observations, of which 15544 are now in the train set, and 3887 in the test set.
In this section, I want to try a few different models and how they perform compared to other models. Also, at the end I will stack some of the models.
This code trains a Support Vector Machine (SVM) classifier, a powerful algorithm used for classification tasks. The SVM learns to classify data points into different categories based on their features.
# SUPPORT VECTOR MACHINE SVM
from sklearn.svm import SVC
SVM = SVC()
SVM.fit(X_train, y_train)
SVM_score = SVM.score(X_test, y_test)
print("Accuracy:", SVM_score)
Accuracy: 0.8350913300746077
This code generates a classification report for the predictions made by a Support Vector Machine (SVM) model.
from sklearn.metrics import classification_report
predictions = SVM.predict(X_test)
report = classification_report(y_test, predictions)
print(report)
precision recall f1-score support
0 0.85 0.81 0.83 1915
1 0.82 0.86 0.84 1972
accuracy 0.84 3887
macro avg 0.84 0.83 0.83 3887
weighted avg 0.84 0.84 0.83 3887
This code trains a Linear Regression model, which is a simple method used for predicting numeric values based on input features.
# LINEAR REGRESSION
from sklearn.linear_model import LinearRegression
linear = LinearRegression()
linear.fit(X_train, y_train)
linear_score = linear.score(X_test, y_test)
print("R²:", linear_score)
R²: 0.6845519239341518
This code implements the K-Nearest Neighbors (KNN) classification algorithm. KNN works by finding the 'k' nearest data points in the training set to a given input, and the majority class among those neighbors is assigned to the input.
# K-NEAREST NEIGHBORS
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(n_neighbors=4)
KNN.fit(X_train, y_train)
KNN_score = KNN.score(X_test, y_test)
print("Accuracy:", KNN_score)
Accuracy: 0.9024954978132236
This code trains a decision tree classifier, a type of machine learning model used for classification tasks. Then, it evaluates the accuracy of the model on test data and prints the accuracy score.
# DECISION TREE
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(min_samples_leaf=40, min_samples_split=300)
decision_tree.fit(X_train, y_train)
DT_score = decision_tree.score(X_test, y_test)
print("Accuracy:", DT_score)
Accuracy: 0.9313094931824029
This code visualizes a decision tree model using a graphical representation. It sets up target names for the classes ("phishing" and "legitimate")
target_names = ["phishing", "legitimate"]
import matplotlib.pyplot as plt
plt.figure(figsize=(40,40))
from sklearn.tree import plot_tree
plot_tree(decision_tree, fontsize=8, feature_names=selected_features, class_names=target_names)
plt.show()
Boost the decision tree with ADA boosting and turns out, the performance (accuracy) increases by 2%.
# AdaBoost with decision trees
from sklearn.ensemble import AdaBoostRegressor
adaboost_decision_tree = AdaBoostRegressor(estimator=decision_tree, n_estimators=50, random_state=21)
X_train = X_train.astype(float)
y_train = y_train.astype(float)
adaboost_decision_tree.fit(X_train, y_train)
ada_dt_score = adaboost_decision_tree.score(X_test, y_test)
print("Accuracy:", ada_dt_score)
Accuracy: 0.944418199439675
This code uses a machine learning method called Random Forest, which creates a powerful model by combining many decision trees. It trains this model with training data and evaluates its accuracy with test data, displaying the accuracy score.
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor(n_estimators = 500, max_depth=25, n_jobs=-1)
random_forest.fit(X_train, y_train)
rf_score = random_forest.score(X_test, y_test)
print("Accuracy:", rf_score)
Accuracy: 0.9399239597187502
This code sets up a method called AdaBoost with Random Forest, a technique that boosts the performance of a random forest model. It trains this boosted model with training data and evaluates its accuracy with test data, displaying the accuracy score.
# AdaBoost with Random Forest
from sklearn.ensemble import AdaBoostRegressor
adaboost_random_forest = AdaBoostRegressor(estimator=random_forest, n_estimators=50, random_state=21)
adaboost_random_forest.fit(X_train, y_train)
ada_rf_score = adaboost_random_forest.score(X_test, y_test)
print("Accuracy:", ada_rf_score)
Accuracy: 0.9576248392484179
This code combines different prediction methods (like linear regression and random forest) into one super method called Stacking Regressor. It learns from data and gives a score showing how accurate its predictions are.
from sklearn.ensemble import StackingRegressor
# A list of tuples with the name of the model and the model itself
estimators_list = [
('linear_regression', linear),
('random_forest', random_forest),
('adaboost', adaboost_decision_tree),
('adaboost_random_forest', adaboost_random_forest)
]
stacking_regressor = StackingRegressor(estimators=estimators_list, final_estimator=RandomForestRegressor(n_estimators=50, max_depth=25, n_jobs=-1))
stacking_regressor.fit(X_train, y_train)
stack_regressor_score = stacking_regressor.score(X_test, y_test)
print("Accuracy:", stack_regressor_score)
Accuracy: 0.9518657489977174
It prints a comparison report displaying each model's performance score. Finally, it identifies the best performing model by finding the model with the highest score and prints its name along with its score.
# List of models and their scores
model_scores = {
"Linear Regression": linear_score,
"Decision Tree": DT_score,
"Random Forest": rf_score,
"K-Nearest Neighbors": KNN_score,
"Support Vector Machine (SVM)": SVM_score,
"Decison Tree with AdaBoost": ada_dt_score,
"Random Forest with AdaBoost": ada_rf_score,
"Stacking Regressor": stack_regressor_score
}
# Print comparison report
print("Model Comparison Report:")
print("=========================")
for model, score in model_scores.items():
print(f"{model}: {score:.4f}")
# Find the best performing model
best_model = max(model_scores, key=model_scores.get)
print(f"\nThe best performing model is: {best_model} with a score of {model_scores[best_model]:.4f}")
Model Comparison Report: ========================= Linear Regression: 0.6846 Decision Tree: 0.9313 Random Forest: 0.9399 K-Nearest Neighbors: 0.9025 Support Vector Machine (SVM): 0.8351 Decison Tree with AdaBoost: 0.9444 Random Forest with AdaBoost: 0.9576 Stacking Regressor: 0.9519 The best performing model is: Random Forest with AdaBoost with a score of 0.9576
from sklearn.model_selection import GridSearchCV
# Define hyperparameters grid for each model
param_grid_linear = {
# Define hyperparameters for Linear Regression if needed
}
param_grid_decision_tree = {
# Define hyperparameters for Decision Tree
'min_samples_leaf': [10, 20, 30, 40],
'min_samples_split': [100, 200, 300, 400]
}
param_grid_random_forest = {
# Define hyperparameters for Random Forest
'n_estimators': [100, 200, 300, 400, 500],
'max_depth': [None, 10, 20, 30, 40, 50]
}
param_grid_knn = {
# Define hyperparameters for KNN
'n_neighbors': [3, 5, 7, 9]
}
param_grid_svm = {
# Define hyperparameters for SVM
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1],
'kernel': ['linear', 'rbf', 'poly']
}
# Define GridSearchCV for each model
grid_search_linear = GridSearchCV(LinearRegression(), param_grid_linear, cv=5)
grid_search_decision_tree = GridSearchCV(DecisionTreeClassifier(), param_grid_decision_tree, cv=5)
grid_search_random_forest = GridSearchCV(RandomForestRegressor(), param_grid_random_forest, cv=5)
grid_search_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5)
grid_search_svm = GridSearchCV(SVC(), param_grid_svm, cv=5)
# Perform grid search for each model
grid_search_linear.fit(X_train, y_train)
grid_search_decision_tree.fit(X_train, y_train)
grid_search_random_forest.fit(X_train, y_train)
grid_search_knn.fit(X_train, y_train)
grid_search_svm.fit(X_train, y_train)
# Print best hyperparameters for each model
print("Best hyperparameters for Linear Regression:", grid_search_linear.best_params_)
print("Best hyperparameters for Decision Tree:", grid_search_decision_tree.best_params_)
print("Best hyperparameters for Random Forest:", grid_search_random_forest.best_params_)
print("Best hyperparameters for KNN:", grid_search_knn.best_params_)
print("Best hyperparameters for SVM:", grid_search_svm.best_params_)